Currently, the issue of wasteful advertising spending poses a significant challenge for many businesses. They are investing substantial amounts in advertising without achieving the expected results. It's evident that a primary cause of this problem is misidentifying the target customer base. This results in an inefficient allocation of advertising budgets. Therefore, there's a need for solutions to accurately identify customers with high purchasing potential, optimizing costs for each product.
All customer information contains valuable insights into shopping habits and personal preferences. Customer segmentation projects enable businesses to understand specific characteristics of various customer groups better. This not only aids in crafting smarter marketing strategies but also enhances customer experiences and fosters better interactions with the market.
The dataset contains a description of purchasing activities of a company. It comprises 29 columns detailing customers' preferences, shopping habits, and demographics. The features are categorized into five types.
| # | Attribute | Description |
|---|---|---|
| 1 | ID | Customer unique identifier code. |
| 2 | Year_Birth | Customer's birth year. |
| 3 | Education | Customer's educational level. |
| 4 | Marital_Status | Customer's marital status. |
| 5 | Income | Monthly income of the customer. |
| 6 | Kidhome | Number of children in the customer's family. |
| 7 | Teenhome | Number of teenagers in the customer's family. |
| 8 | Dt_Customer | Registration date of the customer with the company. |
| 9 | Recency | Days since the customer's most recent purchase. |
| 10 | Complain | 1 if the customer has complained in the past 2 years; 0 if not. |
| # | Attribute | Description |
|---|---|---|
| 11 | MntWines | Amount spent on wine in the last 2 years. |
| 12 | MntFruits | Amount spent on fruits in the last 2 years. |
| 13 | MntMeatProducts | Amount spent on meat products in the last 2 years. |
| 14 | MntFishProducts | Amount spent on fish products in the last 2 years. |
| 15 | MntSweetProducts | Amount spent on sweets in the last 2 years. |
| 16 | MntGoldProds | Amount spent on gold products in the last 2 years. |
| # | Attribute | Description |
|---|---|---|
| 17 | NumDealsPurchases | Number of discounted deals the customer has purchased. |
| 18 | AcceptedCmp1 | 1 if the customer accepted the offer in the first marketing campaign, otherwise 0. |
| 19 | AcceptedCmp2 | 1 if the customer accepted the offer in the second marketing campaign, otherwise 0. |
| 20 | AcceptedCmp3 | 1 if the customer accepted the offer in the third marketing campaign, otherwise 0. |
| 21 | AcceptedCmp4 | 1 if the customer accepted the offer in the fourth marketing campaign, otherwise 0. |
| 22 | AcceptedCmp5 | 1 if the customer accepted the offer in the fifth marketing campaign, otherwise 0. |
| 23 | Response | 1 if the customer accepted the offer in the last marketing campaign, otherwise 0. |
| # | Attribute | Description |
|---|---|---|
| 24 | NumWebPurchases | Number of purchases made through the company's website. |
| 25 | NumCatalogPurchases | Number of purchases made using catalogs. |
| 26 | NumStorePurchases | Number of purchases made directly at the company's stores. |
| 27 | NumWebVisitsMonth | Number of visits to the company's website in the past month. |
| # | Attribute | Description |
|---|---|---|
| 28 | Z_CostContact | - |
| 29 | Z_Revenue | - |
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from pandas import get_dummies
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, LabelEncoder
from yellowbrick.cluster import KElbowVisualizer
import warnings
warnings.filterwarnings('ignore')
df_initial = pd.read_csv('D:\WORKSPACE\PYTHON\Project\Customer Segmentation in Python\Marketing_Campaign\marketing_campaign.csv', sep='\t')
df_initial
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2235 | 10870 | 1967 | Graduation | Married | 61223.0 | 0 | 1 | 13-06-2013 | 46 | 709 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2236 | 4001 | 1946 | PhD | Together | 64014.0 | 2 | 1 | 10-06-2014 | 56 | 406 | ... | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 11 | 0 |
| 2237 | 7270 | 1981 | Graduation | Divorced | 56981.0 | 0 | 0 | 25-01-2014 | 91 | 908 | ... | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2238 | 8235 | 1956 | Master | Together | 69245.0 | 0 | 1 | 24-01-2014 | 8 | 428 | ... | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2239 | 9405 | 1954 | PhD | Married | 52869.0 | 1 | 1 | 15-10-2012 | 40 | 84 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
2240 rows × 29 columns
df_initial.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Z_CostContact 2240 non-null int64 27 Z_Revenue 2240 non-null int64 28 Response 2240 non-null int64 dtypes: float64(1), int64(25), object(3) memory usage: 507.6+ KB
From the results above, we can conclude and note the following:
CHECK THE IMPACT LEVEL OF MISSING DATA ON THE DATASET.
feature_cols = df_initial.columns
print(feature_cols)
Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response'],
dtype='object')
def display_missing(df, feature_cols):
n_rows = df_initial.shape[0]
for col in feature_cols:
missing_count = df[col].isnull().sum()
if missing_count > 0:
print(f"{col} has {missing_count*100/n_rows:.2f}% missing values.")
display_missing(df_initial, feature_cols)
Income has 1.07% missing values.
It appears that only the Income column currently contains missing values, accounting for 1.07% (equivalent to 24 rows with missing values), which does not significantly affect the dataset. Therefore, these values should be removed from the data.
df_initial.dropna(inplace=True)
print("The total number of data-points after removing the rows with missing values are:", len(df_initial))
The total number of data-points after removing the rows with missing values are: 2216
Reformat the data type for the DT_Customer column.
df_initial['Dt_Customer'] = pd.to_datetime(df_initial['Dt_Customer'], format='%d-%m-%Y')
print("The oldest record on customer's enrollment:", min(df_initial['Dt_Customer']).date())
print("The newest record on customer's enrollment:", max(df_initial['Dt_Customer']).date())
The oldest record on customer's enrollment: 2012-07-30 The newest record on customer's enrollment: 2014-06-29
The results indicate that the oldest registration was on July 30, 2012, and the most recent was on June 29, 2014.
Now we will explore the unique values in the categorical features to have a clearer idea about the data.
print("Total categories in the feature Marital_Status:\n", df_initial["Marital_Status"].value_counts())
print("Total categories in the feature Education:\n", df_initial["Education"].value_counts())
Total categories in the feature Marital_Status: Marital_Status Married 857 Together 573 Single 471 Divorced 232 Widow 76 Alone 3 Absurd 2 YOLO 2 Name: count, dtype: int64 Total categories in the feature Education: Education Graduation 1116 PhD 481 Master 365 2n Cycle 200 Basic 54 Name: count, dtype: int64
Widow, YOLO, Divorced, and Absurb can be represented by the value Alone; Married and Together can be represented by Partner.
Attribute Education: There are 5 distinct values. They can be consolidated into new values: Basic representing Basic, Undergraduate
representing 2nd Cycle, Graduate representing Graduation, and Postgraduate representing Master, PhD.
#Segmenting education levels in three groups
df_initial['Education'] = df_initial['Education'].replace({'Graduation': 'Graduate', 'PhD': 'Postgraduate', 'Master': 'Postgraduate', '2n Cycle': 'Postgraduate', 'Basic': 'Undergraduate'})
df_initial['Marital_Status'] = df_initial['Marital_Status'].replace({'Married': 'Partner', 'Together': 'Partner', 'Single': 'Alone', 'Divorced': 'Alone', 'Widow': 'Alone', 'Absurd': 'Alone', 'YOLO': 'Alone'})
In the next section, we will perform the following steps to design some new features:
Create the Age attribute for customers based on Year_Birth, representing the birth year of the respective individuals up to the closest time point of the year the dataset was collected (2015).
Generate attributes Day, Dayofweek, Month, and Year from the available DT_Customer attribute.
Establish a Total_Children attribute to indicate the total number of children in a household, including children (Kidhome) and teenagers (Teenhome).
Introduce an Is_Parent attribute to display the parent status. If an individual has children (attribute Children has a value of 1 or more), indicating they are a parent, the Is_Parent value will be set to 1; otherwise, it will be set to 0.
Create another attribute, Total_Spent, to indicate the total amount spent by customers across various categories over a two-year period.
#Age of customer today
df_initial['Age'] = 2015 - df_initial['Year_Birth']
# Create a new column from Dt_Customer column
df_initial['Day'] = df_initial['Dt_Customer'].apply(lambda x: x.day)
df_initial['Dayofweek'] = df_initial['Dt_Customer'].apply(lambda x: x.day_name())
df_initial['Month'] = df_initial['Dt_Customer'].apply(lambda x: x.month)
df_initial['Year'] = df_initial['Dt_Customer'].apply(lambda x: x.year)
# Feature for total children in the householde
df_initial['Total_Children'] = df_initial['Kidhome'] + df_initial['Teenhome']
#Feature pertaining parenthood
df_initial['Is_Parent'] = df_initial['Total_Children'].apply(lambda x: 1 if x != 0 else 0)
#Total spendings on various items
df_initial['Total_Spent'] = df_initial['MntWines'] + df_initial['MntFruits'] + df_initial['MntMeatProducts'] + df_initial['MntFishProducts'] + df_initial['MntSweetProducts'] + df_initial['MntGoldProds']
Rename the attributes and remove unnecessary columns.
# Clear description of feature
df_initial.rename(columns={'MntWines':'Wines', 'MntFruits':'Fruits', 'MntMeatProducts':'Meats', 'MntFishProducts':'Fish', 'MntSweetProducts':'Sweets', 'MntGoldProds':'Golds'}, inplace=True)
df_initial.rename(columns={'NumWebPurchases':'Web', 'NumCatalogPurchases':'Catalog', 'NumStorePurchases':'Store'}, inplace=True)
# Delete columns
df_initial.drop(['ID', 'Year_Birth', 'Dt_Customer', 'Z_CostContact', 'Z_Revenue'], axis=1, inplace=True)
We have a table with new attributes.
df = df_initial.copy()
df
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | Wines | Fruits | Meats | Fish | ... | Complain | Response | Age | Day | Dayofweek | Month | Year | Total_Children | Is_Parent | Total_Spent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Graduate | Alone | 58138.0 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | ... | 0 | 1 | 58 | 4 | Tuesday | 9 | 2012 | 0 | 0 | 1617 |
| 1 | Graduate | Alone | 46344.0 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | ... | 0 | 0 | 61 | 8 | Saturday | 3 | 2014 | 2 | 1 | 27 |
| 2 | Graduate | Partner | 71613.0 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | ... | 0 | 0 | 50 | 21 | Wednesday | 8 | 2013 | 0 | 0 | 776 |
| 3 | Graduate | Partner | 26646.0 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | ... | 0 | 0 | 31 | 10 | Monday | 2 | 2014 | 1 | 1 | 53 |
| 4 | Postgraduate | Partner | 58293.0 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | ... | 0 | 0 | 34 | 19 | Sunday | 1 | 2014 | 1 | 1 | 422 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2235 | Graduate | Partner | 61223.0 | 0 | 1 | 46 | 709 | 43 | 182 | 42 | ... | 0 | 0 | 48 | 13 | Thursday | 6 | 2013 | 1 | 1 | 1341 |
| 2236 | Postgraduate | Partner | 64014.0 | 2 | 1 | 56 | 406 | 0 | 30 | 0 | ... | 0 | 0 | 69 | 10 | Tuesday | 6 | 2014 | 3 | 1 | 444 |
| 2237 | Graduate | Alone | 56981.0 | 0 | 0 | 91 | 908 | 48 | 217 | 32 | ... | 0 | 0 | 34 | 25 | Saturday | 1 | 2014 | 0 | 0 | 1241 |
| 2238 | Postgraduate | Partner | 69245.0 | 0 | 1 | 8 | 428 | 30 | 214 | 80 | ... | 0 | 0 | 59 | 24 | Friday | 1 | 2014 | 1 | 1 | 843 |
| 2239 | Postgraduate | Partner | 52869.0 | 1 | 1 | 40 | 84 | 3 | 61 | 2 | ... | 0 | 1 | 61 | 15 | Monday | 10 | 2012 | 2 | 1 | 172 |
2216 rows × 32 columns
sns.color_palette('copper_r')
numeric_cols = df.select_dtypes(include=['float','int']).columns
def _plot_numeric_classes(df, col, bins = 10, hist=True, kde=True):
sns.distplot(df[col],
bins = bins,
hist = hist,
kde = kde)
def _distribution_numberic(df, numeric_cols, row = 6, col = 5, figsize=(40,30), bins = 20):
print('Number of numberic: ', len(numeric_cols))
assert row*(col-1)<len(numeric_cols)
plt.figure(figsize=figsize)
plt.subplots_adjust(top=None, bottom=None, left=None, right=None,wspace=0.2, hspace=0.5)
for i in range (1, len(numeric_cols)+1, 1):
try:
plt.subplot(row,col,i)
_plot_numeric_classes(df, numeric_cols[i-1],bins=bins)
plt.title(numeric_cols[i-1])
except:
break
_distribution_numberic(df, numeric_cols)
Number of numberic: 29
outliers_column = ["Age", "Income", "Total_Spent"]
def _boxplot_outliers(df, col, color = 'steelblue'):
for i in range (1, len(col)+1,1):
plt.subplot(len(col),1,i)
plt.title(col[i-1])
sns.boxplot(x=df[col[i-1]], color=color)
plt.tight_layout()
plt.show()
_boxplot_outliers(df, outliers_column)
print('Income:')
print(df['Income'].sort_values().tail(5))
print('\nAge:')
print(df['Age'].sort_values().tail(5))
Income: 164 157243.0 1300 157733.0 687 160803.0 617 162397.0 2233 666666.0 Name: Income, dtype: float64 Age: 424 74 1950 75 192 115 339 116 239 122 Name: Age, dtype: int64
This significantly impacts subsequent processing. Therefore, these values need to be removed from the dataset.
#Dropping the outliers by setting a cap on Age and Income.
df = df[(df["Age"] < 80)]
df = df.drop(2233)
print("The total number of data-points after removing the outliers are:", len(df))
The total number of data-points after removing the outliers are: 2212
# List of columns to be used for the pairplot
data = ['Age', 'Income', 'Total_Spent', 'Recency', 'Is_Parent']
# Create the pairplot and use 'Is_Parent' for coloring points, 'copper_r' palette is used
plot = sns.pairplot(df[data], hue='Is_Parent', palette='copper_r')
# Add a title to the figure
plot.fig.suptitle('Feature Relationship', y=1.05, weight='bold', fontsize=16)
Text(0.5, 1.05, 'Feature Relationship')
We observe that parents tend to spend less money compared to those who are not parents.
The spending trend for products is proportional to the customer's income level.
fig, axes = plt.subplots(2,2, figsize=(25,8))
axes = axes.flatten()
fig.suptitle("When Did the Customer Enrolled To be a Member", weight='bold', fontsize=16)
times = ['Dayofweek','Day','Month','Year']
def _times_registration(df, times, palette='copper_r'):
for i in range(len(times)):
if df[times[i-1]].dtype == 'O':
sns.countplot(x = df[times[i]], order=list(df[times[i]].unique()), palette=palette, ax=axes[i])
else:
sns.countplot(x = df[times[i]], palette=palette, ax=axes[i])
_times_registration(df, times)
The majority of customers register as members of the company on equivalent days.
On the other hand, most customers register on the 12th of each month.
The months of March, May, August, and October have an equivalent number of new registrations, with over 200 new members each month. Additionally, the year 2013 contributes the most members simply because it has more data than other years in this dataset.
The number of customers registering as members by day of the week is consistent, with an average of 71 new members per day.
Customers tend to make purchases more at the beginning and end of the year, possibly because these days often coincide with holidays, leading to increased demand.
Due to incomplete data collection for each year (except for 2013), it's not possible to assess the number of customer registrations by year.
# Age
plt.figure(figsize=(20,5))
plt.title('Customers Age Distribution', weight='bold', fontsize=16)
sns.countplot( x = df["Age"], palette='copper_r')
<Axes: title={'center': 'Customers Age Distribution'}, xlabel='Age', ylabel='count'>
Most of the current customers primarily fall into the young customer segment, ranging from those who are preparing to start a family to those who already have families (ages 36 to 50).
They are financially independent individuals. This segment also represents the largest revenue source for the business.
# Income
plt.figure(figsize=(25,5))
sns.displot( x = df["Income"], kde=True, palette = 'copper_r')
plt.title("Customer Income Distribution" , weight='bold', fontsize=16)
Text(0.5, 1.0, 'Customer Income Distribution')
<Figure size 2500x500 with 0 Axes>
The income of the customers ranges from 20,000 USD to 100,000 USD per month.
The primary focus is between 30,000 USD and 80,000 USD.
# Education
plt.title('Education Levels Among Customers', weight='bold', fontsize=10)
sns.countplot( x=df["Education"], palette='copper_r')
<Axes: title={'center': 'Education Levels Among Customers'}, xlabel='Education', ylabel='count'>
# Marital Status
status = df['Marital_Status'].value_counts()
Alone = status[['Alone']].sum()
Partner= status[['Partner']].sum()
plt.title('Customer Marital Status', weight='bold', fontsize=10)
# Pie chart
palette = sns.color_palette('copper_r')
plt.pie([Alone, Partner], labels = ['Alone', 'Partner'], colors = palette, autopct='%.0f%%');
group = pd.cut(df['Age'], [10, 20, 30, 40, 50, 60, 70, 80])
counts = group.value_counts()
plt.figure(figsize=(8, 6))
ax = counts.plot(kind='bar', color='#f4a460')
plt.xlabel('Age Groups', fontsize=12)
plt.ylabel('Counts', fontsize=12)
plt.title('Distributed By Age Group', fontsize=16, weight='bold')
for i, v in enumerate(counts):
ax.text(i, v + 0.1, str(v), ha='center', va='bottom', fontsize=10)
plt.show()
df2 = df.copy()
df2['Age'] = group
sum_group = df2[['Total_Spent', 'Age']].groupby('Age').sum()
mean_group = df2[['Total_Spent', 'Age']].groupby('Age').mean()
fig, axes = plt.subplots(1,2,figsize=(14,8))
axes = axes.flatten()
sns.barplot(x = sum_group['Total_Spent'], y = sum_group.index, palette='copper_r', ci=None, orient='h', ax=axes[0])
axes[0].set_title('Total Spent on Products\nby Age Groups', weight='bold', fontsize=16)
for i,v in enumerate(sum_group['Total_Spent']):
if i == 0 or i == 6:
axes[0].text(v+30000, i, '$ {}'.format(v), horizontalalignment='center', verticalalignment='center', weight='bold', color='black', fontsize=12)
else:
axes[0].text(v-40000, i, '$ {}'.format(v), horizontalalignment='center', verticalalignment='center', weight='bold', color='white', fontsize=12)
sns.barplot(x=mean_group['Total_Spent'], y=mean_group.index, palette='copper_r', ci=None, orient='h', ax=axes[1])
axes[1].set_title('Average Spent on Products\nby Age Groups', weight='bold', fontsize=16)
for i,v in enumerate(mean_group['Total_Spent']):
axes[1].text(v-130, i, '$ {}'.format(round(v,2)), horizontalalignment='center', verticalalignment='center', weight='bold', color='white', fontsize=12)
Customers aged between 70-80, despite being a small portion of buyers, are willing to spend a significant amount compared to other groups to purchase various food items such as wine, fruits, meat, and eggs.
It's not surprising that the youngest age group, aged 10-20, spends heavily on items like candy.
Among the financially independent age groups, there's consistent demand for gold across all, especially within the 60-70 age bracket (though other age groups from 20-60 are not far behind). They tend to accumulate and preserve wealth in the form of gold. They might also be investing in gold for profit.
Overall, the revenue generated from sales indicates that a significant portion comes from the sale of wine and meat.
fig, axd = plt.subplot_mosaic([[0,1,2],[3,4,5], [6,6,7], [6,6,7], [6,6,7]],
constrained_layout=True, figsize=(18,10))
#fig.suptitle("Customer's Average Spent on Products\nby Age Groups", weight='bold', fontsize=20)
# Bar plot
products = ['Wines','Fruits','Meats','Fish','Sweets','Golds']
def _barplot(df, col):
for i in range(len(col)):
sns.barplot(df, x=group, y=col[i], palette='copper_r', ci=None, ax=axd[i])
axd[i].set_title(col[i], weight='bold')
_barplot(df, products)
for i in range(6):
for p in axd[i].patches:
axd[i].annotate(format(p.get_height(), '.2f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 9), textcoords='offset points', fontsize=10)
# Pie chart
data = df[['Wines', 'Fruits', 'Meats', 'Fish', 'Sweets', 'Golds']].sum().sort_values()
palette = sns.color_palette('copper')
wedges, texts, autotexts = axd[6].pie(x=data, labels=data.index, autopct='%.2f%%', colors=palette, textprops=dict(fontsize=12))
axd[6].set_title("\n\nPercentage of Company's Revenue\nfrom Products", weight='bold', fontsize=20, x=1.2)
for autotext in autotexts:
autotext.set_color('white')
autotext.set_weight('bold')
# Text
for i, (name, value) in enumerate(zip(data.index, data)):
axd[6].text(2.3, 0.5-0.2*i, r"$\bf{" + name + "}$" + "\t:" + " \$" + str(value), fontsize=14)
axd[7].axis('off')
(0.0, 1.0, 0.0, 1.0)
The data distribution is uneven across each group. In total, there are only 21 people in both the youngest and oldest age groups. This is why there's a significant contrast in total expenditures as mentioned above.
Due to the unequal number of individuals in each group, we can use the average spending level. In the 70-80 age group, the average spending per purchase is the highest at 1,080.47 USD/product. Similarly, for the 10-20 age group, the average spending per product slightly exceeds that of the 40-50 age group.
fig, axd = plt.subplot_mosaic([[0,1,2], [3,3,4], [3,3,4]],
constrained_layout=True, figsize=(18,8))
fig.suptitle("Average Number of Purchases Made\nThrough Different Methods by Age Groups", weight='bold', fontsize=20)
custom_ylim = (0, 8)
plt.setp(axd[0], ylim=custom_ylim)
plt.setp(axd[1], ylim=custom_ylim)
# Bar plot
channel = ['Web', 'Catalog', 'Store']
_barplot(df, channel)
# Pie chart
data = df[['Web', 'Catalog', 'Store']].sum().sort_values()
palette = sns.color_palette('copper')
wedges, texts, autotexts = axd[3].pie(x=data, labels=data.index, autopct='%.2f%%', colors=palette, textprops=dict(fontsize=12));
axd[3].set_title('\n\nPercentage of Purchases Made\nThrough Different Methods', weight='bold', fontsize=20, x=1.35)
for autotext in autotexts:
autotext.set_color('white')
autotext.set_weight('bold')
# Text
for i, (name, value) in enumerate(zip(data.index, data)):
axd[3].text(2.3, 0.3-0.2*i, r"$\bf{" + name + "}$" + "\t:" + str(value) + " times", fontsize=14)
axd[4].axis('off')
(0.0, 1.0, 0.0, 1.0)
Purchases made through the website, catalog, and physical stores are primarily concentrated among individuals aged over 70. Even though there were only 15 people in this age group in the previous survey, it indicates that their purchasing volume is much higher than other age groups.
A significant number of orders are executed in physical stores, accounting for 46.20% of the company's total purchases.
plt.figure(figsize=(19,7))
plt.subplot(1, 2, 1)
plt.title('Average Number of Purchases Made with a Discount\nby Age Groups', weight='bold', fontsize=12)
ax = sns.barplot(data=df, x=group, y='NumDealsPurchases', hue='Is_Parent', ci=None, palette='copper')
plt.subplot(1, 2, 2)
plt.title('Ratio of Discounted Purchases to Normal Perchases Per Education Level', weight='bold', fontsize=12)
discounted_purchases = df.groupby('Education').NumDealsPurchases.sum()
total_purchases = df.groupby('Education')[['Web', 'Catalog', 'Store']].sum().sum(axis=1)
percentage = round(discounted_purchases * 100 / total_purchases, 2).sort_values(ascending=False)
ax = sns.barplot(y=percentage.index, x=percentage.values,palette='copper')
plt.show()
df.rename(columns = {'Response':'AcceptedCmp6'},inplace=True)
plt.figure(figsize=(10,4))
plt.title('Percentage of Customer Who Accepted the nth Offer', weight ='bold', fontsize = 16)
percent = df[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp6']].sum()*100/len(df)
ax = percent.plot.bar(color='#c08552')
percent.plot(style='o-', colormap='copper')
plt.setp(ax, ylim=(0, 18))
for i,v in enumerate(percent):
plt.text(i, v+1, '{:.2f}%'.format(v), horizontalalignment='center', weight='bold', color='Black', fontsize=10)
The direct acceptance rate of offers in the first campaign was only 6.42%. The figure was even lower in the second campaign with only 1.36% of customers.
The most recent campaign successfully attracted the highest number of customers compared to previous campaigns with a customer rate of 15.05%.
data = df.select_dtypes(include=[np.number]).columns
plt.figure(figsize=(12,10))
plt.title('Feature correlation', weight='bold', fontsize=16, y=1.05)
sns.heatmap(df[data].corr(), cmap='copper')
<Axes: title={'center': 'Feature correlation'}>
The total number of children and the status of being a parent have a low correlation with the customer's total expenditure.
As expected, income and all products are the most influential characteristics affecting the total expenditure.
# Select columns with non-numeric data types
obj_feat = df.select_dtypes(exclude=[np.number]).columns
# Create dummy variables
dummies = get_dummies(df[obj_feat], drop_first=True)
df_final = pd.concat([df, dummies], axis=1)
df_final.drop(obj_feat, axis=1, inplace=True)
df_final.shape
(2212, 38)
# Normalizes the DataFrame's numerical features. Ensuring that they have a mean close to 0 and a standard deviation close to 1.
scaler = StandardScaler()
scaled = scaler.fit_transform(df_final)
df_final_scaled = pd.DataFrame(scaled, columns=df_final.columns)
df_final_scaled.head()
| Income | Kidhome | Teenhome | Recency | Wines | Fruits | Meats | Fish | Sweets | Golds | ... | Total_Spent | Education_Postgraduate | Education_Undergraduate | Marital_Status_Partner | Dayofweek_Monday | Dayofweek_Saturday | Dayofweek_Sunday | Dayofweek_Thursday | Dayofweek_Tuesday | Dayofweek_Wednesday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.287105 | -0.822754 | -0.929699 | 0.310353 | 0.977660 | 1.552041 | 1.690293 | 2.453472 | 1.483713 | 0.852576 | ... | 1.676245 | -0.944572 | -0.158187 | -1.349603 | -0.423207 | -0.412759 | -0.396874 | -0.401441 | 2.616099 | -0.423207 |
| 1 | -0.260882 | 1.040021 | 0.908097 | -0.380813 | -0.872618 | -0.637461 | -0.718230 | -0.651004 | -0.634019 | -0.733642 | ... | -0.963297 | -0.944572 | -0.158187 | -1.349603 | -0.423207 | 2.422719 | -0.396874 | -0.401441 | -0.382249 | -0.423207 |
| 2 | 0.913196 | -0.822754 | -0.929699 | -0.795514 | 0.357935 | 0.570540 | -0.178542 | 1.339513 | -0.147184 | -0.037254 | ... | 0.280110 | -0.944572 | -0.158187 | 0.740959 | -0.423207 | -0.412759 | -0.396874 | -0.401441 | -0.382249 | 2.362908 |
| 3 | -1.176114 | 1.040021 | -0.929699 | -0.795514 | -0.872618 | -0.561961 | -0.655787 | -0.504911 | -0.585335 | -0.752987 | ... | -0.920135 | -0.944572 | -0.158187 | 0.740959 | 2.362908 | -0.412759 | -0.396874 | -0.401441 | -0.382249 | -0.423207 |
| 4 | 0.294307 | 1.040021 | -0.929699 | 1.554453 | -0.392257 | 0.419540 | -0.218684 | 0.152508 | -0.001133 | -0.559545 | ... | -0.307562 | 1.058681 | -0.158187 | 0.740959 | -0.423207 | -0.412759 | 2.519690 | -0.401441 | -0.382249 | -0.423207 |
5 rows × 38 columns
# Converts the value of a column to a single integer.
le = LabelEncoder()
for obj in obj_feat[:-1]:
trans = le.fit_transform(df[obj])
df[obj] = trans
The Elbow method is one of the famous techniques to determine the optimal number of clusters in data. We will transform the data, by default computing the SSE (Sum of Squared Errors) or the total squared error (distance) from each point to its assigned centroid using the Euclidean distance. Although SSE will decrease as k increases, ideally, we should have a small number k that still results in a low SSE.
The value of k can vary due to the random initialization of cluster centroids in KMeans Clustering. To make it consistent throughout this notebook, we set the random state to 123.
elbow = KElbowVisualizer(KMeans(random_state=123), k=10)
elbow.fit(df_final_scaled)
elbow.show()
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
model = KMeans(n_clusters=5, random_state=123)
yhat = model.fit_predict(df_final_scaled)
df['Cluster'] = yhat
df
| Education | Marital_Status | Income | Kidhome | Teenhome | Recency | Wines | Fruits | Meats | Fish | ... | AcceptedCmp6 | Age | Day | Dayofweek | Month | Year | Total_Children | Is_Parent | Total_Spent | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 58138.0 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | ... | 1 | 58 | 4 | Tuesday | 9 | 2012 | 0 | 0 | 1617 | 0 |
| 1 | 0 | 0 | 46344.0 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | ... | 0 | 61 | 8 | Saturday | 3 | 2014 | 2 | 1 | 27 | 1 |
| 2 | 0 | 1 | 71613.0 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | ... | 0 | 50 | 21 | Wednesday | 8 | 2013 | 0 | 0 | 776 | 0 |
| 3 | 0 | 1 | 26646.0 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | ... | 0 | 31 | 10 | Monday | 2 | 2014 | 1 | 1 | 53 | 1 |
| 4 | 1 | 1 | 58293.0 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | ... | 0 | 34 | 19 | Sunday | 1 | 2014 | 1 | 1 | 422 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2235 | 0 | 1 | 61223.0 | 0 | 1 | 46 | 709 | 43 | 182 | 42 | ... | 0 | 48 | 13 | Thursday | 6 | 2013 | 1 | 1 | 1341 | 2 |
| 2236 | 1 | 1 | 64014.0 | 2 | 1 | 56 | 406 | 0 | 30 | 0 | ... | 0 | 69 | 10 | Tuesday | 6 | 2014 | 3 | 1 | 444 | 2 |
| 2237 | 0 | 0 | 56981.0 | 0 | 0 | 91 | 908 | 48 | 217 | 32 | ... | 0 | 34 | 25 | Saturday | 1 | 2014 | 0 | 0 | 1241 | 0 |
| 2238 | 1 | 1 | 69245.0 | 0 | 1 | 8 | 428 | 30 | 214 | 80 | ... | 0 | 59 | 24 | Friday | 1 | 2014 | 1 | 1 | 843 | 2 |
| 2239 | 1 | 1 | 52869.0 | 1 | 1 | 40 | 84 | 3 | 61 | 2 | ... | 1 | 61 | 15 | Monday | 10 | 2012 | 2 | 1 | 172 | 1 |
2212 rows × 33 columns
sns.countplot(data=df, x='Cluster', palette='copper_r')
<Axes: xlabel='Cluster', ylabel='count'>
fig, axd = plt.subplot_mosaic([[0,0],[1,2]], constrained_layout=True, figsize=(14,8))
fig.suptitle('Income vs Total Spent', weight='bold', fontsize=16)
sns.scatterplot(data=df, x='Income', y='Total_Spent', hue='Cluster', palette='copper_r', ax=axd[0])
sns.violinplot(data=df, x='Cluster', y='Total_Spent', palette='copper_r', ax=axd[1])
sns.violinplot(data=df, x='Cluster', y='Income', palette='copper_r', ax=axd[2])
<Axes: label='2', xlabel='Cluster', ylabel='Income'>
We can see that:
Segment 0: High income and expenditure
Segment 1: Low income and expenditure
Segment 2: Average income and expenditure
Segment 3: Lowest income and expenditure
Segment 4: Highest income and expenditure.
g = sns.FacetGrid(data=df, col='Cluster')
g.map(sns.kdeplot, 'Age', 'Total_Spent', color='#c08552', fill=True)
g.fig.suptitle('Age and Total Spent by Cluster', fontsize=9, weight='bold')
Text(0.5, 0.98, 'Age and Total Spent by Cluster')
g = sns.FacetGrid(data=df, col='Cluster')
g.map(sns.kdeplot, 'Is_Parent', 'Total_Spent', color='#c08552', fill=True)
plt.text(0.6,3900, '0: Non Parent\n1: Parent', weight='bold', fontsize=12)
Text(0.6, 3900, '0: Non Parent\n1: Parent')
g = sns.FacetGrid(data=df, col='Cluster')
g.map(sns.kdeplot, 'Teenhome', 'Total_Spent', color='#c08552', fill=True)
<seaborn.axisgrid.FacetGrid at 0x25727dc2910>
g = sns.FacetGrid(data=df, col='Cluster')
g.map(sns.kdeplot, 'Kidhome', 'Total_Spent', color='#c08552', fill=True)
<seaborn.axisgrid.FacetGrid at 0x257274e6c90>
g = sns.FacetGrid(data=df, col='Cluster')
g.map(sns.kdeplot, 'Total_Children', 'Total_Spent', color='#c08552', fill=True)
<seaborn.axisgrid.FacetGrid at 0x2572a3a3f10>
g = sns.FacetGrid(data=df, col='Cluster')
g.map(sns.kdeplot, 'Marital_Status', 'Total_Spent', color='#c08552', fill=True)
plt.text(0.7,3900, '0: Alone\n1: Partner', weight='bold', fontsize=12)
Text(0.7, 3900, '0: Alone\n1: Partner')
g = sns.FacetGrid(data=df, col='Cluster')
g.map(sns.scatterplot, 'Education', 'Total_Spent', color='#c08552')
plt.text(0.6,3400, '0: Graduate\n1: Postgraduate\n2: Undergraduate', weight='bold', fontsize=12)
Text(0.6, 3400, '0: Graduate\n1: Postgraduate\n2: Undergraduate')
Conclude the characteristics of each cluster:
Cluster 0:
Cluster 1:
Cluster 2:
Cluster 3:
Cluster 4:
plt.figure(figsize=(9,4))
plt.title("Average Number of Days Since Customer's Last Purchase\nby Clusters", weight='bold', fontsize=16)
sns.barplot(data=df, x='Cluster', y='Recency', palette='copper_r', ci=None)
<Axes: title={'center': "Average Number of Days Since Customer's Last Purchase\nby Clusters"}, xlabel='Cluster', ylabel='Recency'>
All segments show a similar average of around 50 days. This indicates that customers do not have a consistent trend of repurchasing the business's products. It reflects some areas where the business may be lacking in retaining its customers, such as:
Product quality and customer service.
Ineffective marketing strategies.
Failure to stimulate customers to shop more frequently.
Increasing market competition.
fig, axes = plt.subplots(2,3, figsize=(16,8))
fig.suptitle("Customer's Average Spent on Products\nby Clusters", weight='bold', fontsize=20)
axes = axes.flatten()
def _barplot_cluster(df, col, x = 'Cluster', palette='copper_r', ci=None):
for i in range(len(col)):
sns.barplot(df, x=x, y=col[i], palette=palette, ci=ci, ax=axes[i])
axes[i].set_title(col[i], weight='bold')
_barplot_cluster(df, products)
plt.tight_layout()
Cluster 0: ranks at the top in terms of consumption of items.
Cluster 1: consumes the least among segments.
Cluster 2: has average consumption, with gold products being the most commonly consumed.
Cluster 3: has low consumption levels, almost no consumption of wine products.
Cluster 4: has the highest consumption levels across items.
fig, axes = plt.subplots(1,3, figsize=(16,5))
fig.suptitle("Average Number of Purchases Made\nThrough Different Methods by Clusters", weight='bold', fontsize=16)
axes = axes.flatten()
custom_ylim = (0, 8)
plt.setp(axes[0], ylim=custom_ylim)
plt.setp(axes[1], ylim=custom_ylim)
# Bar plot
_barplot_cluster(df, channel)
plt.tight_layout()
The first Cluster prioritizes shopping in physical stores the most.
Cluster 2 prefers purchasing through websites.
Both Cluster 2 and 4 prioritize shopping via catalogs.
fig, axes = plt.subplots(2,3, figsize=(16,8))
fig.suptitle("Average Number of Purchases Made\nThrough Different Methods by Clusters", weight='bold', fontsize=20)
axes = axes.flatten()
campaign = ['AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4', 'AcceptedCmp5','AcceptedCmp6']
_barplot_cluster(df, campaign)
plt.tight_layout()
Cluster 4 has the strongest response to the campaign.
Cluster 3 has the weakest response to the campaign.
Program 3 attracts the most participation from various clusters for shopping.
plt.figure(figsize=(9,4))
plt.title('Average Number of Purchases Made with a Discount\nby Clusters', weight='bold', fontsize=16)
sns.barplot(data=df, x='Cluster', y='NumDealsPurchases', ci=None, palette='copper_r')
<Axes: title={'center': 'Average Number of Purchases Made with a Discount\nby Clusters'}, xlabel='Cluster', ylabel='NumDealsPurchases'>
Cluster 2 is attracted to the most aggressive discount campaign. Evidence shows that the campaign cluster 2 participated in are related to the highest discount levels.
Most of the customers are those with average income levels, lower spending habits, larger family sizes, and are generally not influenced by marketing campaigns.
Customers with higher incomes tend to have elevated spending levels and are more responsive to marketing initiatives.
Enhance the customer retention programs to ensure repeat purchases.
It's essential to reassess the implemented programs and campaigns, evaluating their impact on each segment to discern distinct factors influencing the purchasing behaviors of each segment.